Now that we understand the logic of spatial relationship queries, let’s take a look at another fundamental spatial operation that relies on them.
This operation, called a spatial join, is the process by which we can leverage the spatial relationships between distinct datasets to merge their information into a new output dataset.
This operation can be thought as the spatial equivalent of an attribute join, in which multiple tabular datasets can be merged by aligning matching values in a common column that they both contain. Thus, we’ll start by developing an understanding of this operation first!
Instructor Notes
library(sf)
library(tmap)
Let’s read in a table of data from the US Census’ 5-year American Community Survey (ACS5).
# Read in the ACS5 data for CA into an `sf` object.
# Note: We force the FIPS_11_digit to be read in as a string to preserve any leading zeroes.
acs5_df = read.csv("notebook_data/census/ACS5yr/census_variables_CA.csv")
head(acs5_df)
## NAME c_race c_white c_black c_asian
## 1 Census Tract 4012, Alameda County, California 2456 1287 476 259
## 2 Census Tract 4013, Alameda County, California 3983 845 1348 827
## 3 Census Tract 4014, Alameda County, California 4340 713 1902 593
## 4 Census Tract 4015, Alameda County, California 2080 563 1064 215
## 5 Census Tract 4016, Alameda County, California 1889 324 960 247
## 6 Census Tract 4017, Alameda County, California 2544 553 634 302
## c_latinx c_race_moe c_white_moe c_black_moe c_asian_moe c_latinx_moe
## 1 283 213 191 116 124 195
## 2 796 680 186 411 283 236
## 3 981 644 314 440 198 620
## 4 190 369 222 283 116 105
## 5 274 400 135 376 164 149
## 6 945 314 140 259 144 247
## state_fips county_fips tract_fips med_rent med_hhinc c_tenants c_owners
## 1 6 1 401200 1251 62064 1245 544
## 2 6 1 401300 927 35030 1743 213
## 3 6 1 401400 884 28264 1447 274
## 4 6 1 401500 718 38684 992 449
## 5 6 1 401600 1088 32663 707 140
## 6 6 1 401700 1147 63052 1083 398
## c_renters med_rent_moe med_hhinc_moe c_tenants_moe c_owners_moe c_renters_moe
## 1 701 128 14598 60 94 91
## 2 1530 64 6917 87 69 104
## 3 1173 64 5802 106 75 125
## 4 543 80 12077 82 96 114
## 5 567 79 16290 88 61 90
## 6 685 214 19070 95 85 116
## c_movers c_stay c_movelocal c_movecounty c_movestate c_moveabroad
## 1 2448 1995 253 143 25 32
## 2 3978 2434 1114 252 90 88
## 3 4269 3448 699 76 27 19
## 4 2080 1750 211 112 7 0
## 5 1860 1545 148 153 4 10
## 6 2535 2001 350 116 34 34
## c_movers_moe c_stay_moe c_movelocal_moe c_movecounty_moe c_movestate_moe
## 1 213 188 187 66 39
## 2 680 324 512 115 75
## 3 605 618 246 52 34
## 4 369 353 112 103 11
## 5 393 368 89 149 6
## 6 312 282 193 62 47
## c_moveabroad_moe c_commute c_car c_carpool c_transit c_bike c_walk
## 1 31 1460 805 94 276 122 85
## 2 98 2046 698 223 801 37 214
## 3 26 1595 751 34 408 186 163
## 4 12 982 493 89 226 47 17
## 5 16 603 344 74 107 38 0
## 6 39 1585 648 284 373 21 29
## c_commute_moe c_car_moe c_carpool_moe c_transit_moe c_bike_moe c_walk_moe
## 1 191 144 109 105 60 43
## 2 468 183 133 264 36 121
## 3 396 211 28 179 108 151
## 4 222 145 67 82 34 15
## 5 170 128 74 120 44 12
## 6 225 168 124 141 22 44
## year FIPS_11_digit p_white p_black p_asian p_latinx p_owners
## 1 2013 6001401200 0.5240228 0.1938111 0.1054560 0.11522801 0.4369478
## 2 2013 6001401300 0.2121516 0.3384384 0.2076324 0.19984936 0.1222031
## 3 2013 6001401400 0.1642857 0.4382488 0.1366359 0.22603687 0.1893573
## 4 2013 6001401500 0.2706731 0.5115385 0.1033654 0.09134615 0.4526210
## 5 2013 6001401600 0.1715193 0.5082054 0.1307570 0.14505029 0.1980198
## 6 2013 6001401700 0.2173742 0.2492138 0.1187107 0.37146226 0.3674977
## p_renters p_stay p_movelocal p_movecounty p_movestate p_moveabroad
## 1 0.5630522 0.8149510 0.10334967 0.05841503 0.010212418 0.013071895
## 2 0.8777969 0.6118653 0.28004022 0.06334842 0.022624434 0.022121669
## 3 0.8106427 0.8076833 0.16373858 0.01780276 0.006324666 0.004450691
## 4 0.5473790 0.8413462 0.10144231 0.05384615 0.003365385 0.000000000
## 5 0.8019802 0.8306452 0.07956989 0.08225806 0.002150538 0.005376344
## 6 0.6325023 0.7893491 0.13806706 0.04575937 0.013412229 0.013412229
## p_car p_carpool p_transit p_bike p_walk
## 1 0.5513699 0.06438356 0.1890411 0.08356164 0.05821918
## 2 0.3411535 0.10899316 0.3914956 0.01808407 0.10459433
## 3 0.4708464 0.02131661 0.2557994 0.11661442 0.10219436
## 4 0.5020367 0.09063136 0.2301426 0.04786151 0.01731161
## 5 0.5704809 0.12271973 0.1774461 0.06301824 0.00000000
## 6 0.4088328 0.17917981 0.2353312 0.01324921 0.01829653
Brief summary of the data:
Below is a table of the variables in this table. They were combined from different ACS 5 year tables.
NOTE:
c_ are countsmed_ are medians_moe are margin of error estimatesp_ are proportions calcuated from the counts divided by the table denominator (the total count for whom that variable was assessed)| Variable | Description |
|---|---|
c_race |
Total population |
c_white |
Total white non-Latinx |
c_black |
Total black and African American non-Latinx |
c_asian |
Total Asian non-Latinx |
c_latinx |
Total Latinx |
state_fips |
State level FIPS code |
county_fips |
County level FIPS code |
tract_fips |
Tracts level FIPS code |
med_rent |
Median rent |
med_hhinc |
Median household income |
c_tenants |
Total tenants |
c_owners |
Total owners |
c_renters |
Total renters |
c_movers |
Total number of people who moved |
c_stay |
Total number of people who stayed |
c_movelocal |
Number of people who moved locally |
c_movecounty |
Number of people who moved counties |
c_movestate |
Number of people who moved states |
c_moveabroad |
Number of people who moved abroad |
c_commute |
Total number of commuters |
c_car |
Number of commuters who use a car |
c_carpool |
Number of commuters who carpool |
c_transit |
Number of commuters who use public transit |
c_bike |
Number of commuters who bike |
c_walk |
Number of commuters who bike |
year |
ACS data year |
FIPS_11_digit |
11-digit FIPS code |
| …. | and more |
We’re going to drop all of our moe columns by identifying all of those that end with _moe. We can do that in two steps, first by using filter to identify columns that contain the string _moe.
tidyverse will help with this!
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.1 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
acs5_df = acs5_df %>% select(-contains("_moe"))
Unfortunately, when this dataset reads in, the 11-digit FIPS codes that should be strings actually read in as numerics, and thus the leading 0 gets truncated. We’re going to need those FIPS code in the correct format later, so let’s reformat them now.
# recast the FIPS 11-digit codes as strings, pasting a 0 at the front of each
acs5_df$FIPS_11_digit = paste0('0', acs5_df$FIPS_11_digit)
And lastly, let’s grab only the rows for year 2018 and county FIPS code 1 (i.e. Alameda County)
acs5_df_ac = acs5_df[acs5_df$year==2018 & acs5_df$county_fips==1, ]
Now, take another look at the dataframe
head(acs5_df_ac)
## NAME c_race c_white c_black
## 8324 Census Tract 4415.01, Alameda County, California 6570 677 111
## 8325 Census Tract 4047, Alameda County, California 2079 1515 134
## 8326 Census Tract 4425, Alameda County, California 7748 1430 375
## 8327 Census Tract 4503, Alameda County, California 5301 2597 96
## 8328 Census Tract 4506.07, Alameda County, California 5971 2832 324
## 8329 Census Tract 4049, Alameda County, California 3973 2383 394
## c_asian c_latinx state_fips county_fips tract_fips med_rent med_hhinc
## 8324 4740 570 6 1 441501 1883 152394
## 8325 199 175 6 1 404700 3250 181000
## 8326 3379 1904 6 1 442500 1999 95323
## 8327 1077 1315 6 1 450300 2626 121131
## 8328 1726 804 6 1 450607 1837 96061
## 8329 442 337 6 1 404900 1446 111846
## c_tenants c_owners c_renters c_movers c_stay c_movelocal c_movecounty
## 8324 1796 1634 162 6491 6010 257 68
## 8325 790 680 110 2043 1822 58 77
## 8326 2264 1245 1019 7628 6858 557 29
## 8327 1814 1249 565 5249 4668 134 326
## 8328 2445 883 1562 5905 4735 715 254
## 8329 1874 1095 779 3939 3647 123 100
## c_movestate c_moveabroad c_commute c_car c_carpool c_transit c_bike c_walk
## 8324 129 27 2317 1765 264 127 28 8
## 8325 65 21 1075 572 191 170 7 6
## 8326 23 161 3330 2382 375 214 59 34
## 8327 114 7 2819 2112 183 265 28 33
## 8328 201 0 3315 2316 440 193 71 114
## 8329 34 35 2211 1030 303 522 48 0
## year FIPS_11_digit p_white p_black p_asian p_latinx p_owners
## 8324 2018 06001441501 0.1030441 0.01689498 0.7214612 0.08675799 0.9097996
## 8325 2018 06001404700 0.7287157 0.06445406 0.0957191 0.08417508 0.8607595
## 8326 2018 06001442500 0.1845638 0.04839959 0.4361125 0.24574084 0.5499117
## 8327 2018 06001450300 0.4899076 0.01810979 0.2031692 0.24806640 0.6885336
## 8328 2018 06001450607 0.4742924 0.05426227 0.2890638 0.13465081 0.3611452
## 8329 2018 06001404900 0.5997986 0.09916939 0.1112509 0.08482255 0.5843116
## p_renters p_stay p_movelocal p_movecounty p_movestate p_moveabroad
## 8324 0.09020045 0.9258974 0.03959328 0.010476044 0.019873671 0.004159606
## 8325 0.13924051 0.8918257 0.02838962 0.037689672 0.031815957 0.010279001
## 8326 0.45008834 0.8990561 0.07302045 0.003801783 0.003015207 0.021106450
## 8327 0.31146637 0.8893122 0.02552867 0.062107068 0.021718423 0.001333587
## 8328 0.63885481 0.8018628 0.12108383 0.043014395 0.034038950 0.000000000
## 8329 0.41568837 0.9258695 0.03122620 0.025387154 0.008631632 0.008885504
## p_car p_carpool p_transit p_bike p_walk
## 8324 0.7617609 0.11394044 0.05481226 0.012084592 0.003452741
## 8325 0.5320930 0.17767442 0.15813953 0.006511628 0.005581395
## 8326 0.7153153 0.11261261 0.06426426 0.017717718 0.010210210
## 8327 0.7492018 0.06491664 0.09400497 0.009932600 0.011706279
## 8328 0.6986425 0.13273002 0.05822021 0.021417798 0.034389140
## 8329 0.4658526 0.13704206 0.23609227 0.021709634 0.000000000
Now let’s also read in our census tracts again!
tracts_sf = st_read("./notebook_data/census/Tracts/cb_2018_06_tract_500k.shp", )
## Reading layer `cb_2018_06_tract_500k' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/census/Tracts/cb_2018_06_tract_500k.shp' using driver `ESRI Shapefile'
## Simple feature collection with 8041 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.4096 ymin: 32.53416 xmax: -114.1312 ymax: 42.00948
## Geodetic CRS: NAD83
head(tracts_sf)
## Simple feature collection with 6 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.5009 ymin: 37.92535 xmax: -120.3345 ymax: 39.30875
## Geodetic CRS: NAD83
## STATEFP COUNTYFP TRACTCE AFFGEOID GEOID NAME LSAD
## 1 06 009 000300 1400000US06009000300 06009000300 3 CT
## 2 06 011 000300 1400000US06011000300 06011000300 3 CT
## 3 06 013 303102 1400000US06013303102 06013303102 3031.02 CT
## 4 06 013 303202 1400000US06013303202 06013303202 3032.02 CT
## 5 06 013 303203 1400000US06013303203 06013303203 3032.03 CT
## 6 06 013 307102 1400000US06013307102 06013307102 3071.02 CT
## ALAND AWATER geometry
## 1 457009794 394122 MULTIPOLYGON (((-120.764 38...
## 2 952744514 195376 MULTIPOLYGON (((-122.5001 3...
## 3 6507019 0 MULTIPOLYGON (((-121.7294 3...
## 4 3725528 0 MULTIPOLYGON (((-121.7235 3...
## 5 6354210 0 MULTIPOLYGON (((-121.7449 3...
## 6 1603153 0 MULTIPOLYGON (((-121.8226 3...
tracts_sf_ac = tracts_sf[tracts_sf$COUNTYFP == '001',]
plot(tracts_sf_ac$geometry)
Attribute Joins between sf data.frames and plain data.frames
We just mapped the census tracts. But what makes a map powerful is when you map the data associated with the locations.
tracts_sf_ac: These are polygon data in this sf data.frame. However, as we saw with the head command, it contains no attributes of interest!
acs5_df_ac: These are 2018 ACS data attributes for CA census tracts read in from a CSV file (‘census_variables_CA.csv’), into a regular data.frame. However, this data has no geometry column!
In order to map the ACS data we need to associate it with the tracts. We can do that by joining the columns from acs5_df_ac to the columns of tracts_gdf_ac using a common column as the key for matching rows. This process is called an attribute join.
Question
The image above gives us a nice conceptual summary of the types of joins we could run.
(NOTE: You can read more about merging sf and plain data.frames here.)
Okay, here we go!
Let’s take a look at the common column in both our data.frames.
head(tracts_sf_ac['GEOID'])
## Simple feature collection with 6 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.3159 ymin: 37.66497 xmax: -122.1062 ymax: 37.84792
## Geodetic CRS: NAD83
## GEOID geometry
## 27 06001425101 MULTIPOLYGON (((-122.3142 3...
## 28 06001428600 MULTIPOLYGON (((-122.2799 3...
## 29 06001432600 MULTIPOLYGON (((-122.1675 3...
## 30 06001433200 MULTIPOLYGON (((-122.1667 3...
## 31 06001433900 MULTIPOLYGON (((-122.1209 3...
## 32 06001436100 MULTIPOLYGON (((-122.1329 3...
head(acs5_df_ac['FIPS_11_digit'])
## FIPS_11_digit
## 8324 06001441501
## 8325 06001404700
## 8326 06001442500
## 8327 06001450300
## 8328 06001450607
## 8329 06001404900
Note that they are not named the same thing.
That's okay! We just need to know that they contain the same information.
Also note that they are not in the same order.
That's not only okay... That's the point! (If they were in the same order already then we could just join them side by side, without having R find and line up the matching rows from each!)
Let’s do a left join to keep all of the census tracts in Alameda County and only the ACS data for those tracts.
NOTE: To figure out how to do this we could always take a peek at the documentation by calling ?base::merge.
?base::merge
# Left join keeps all tracts and the acs data for those tracts
tracts_acs_sf_ac = base::merge(tracts_sf_ac, acs5_df_ac, by.x = 'GEOID', by.y = "FIPS_11_digit", all.x=TRUE)
# what is the class of the output data object
class(tracts_acs_sf_ac)
## [1] "sf" "data.frame"
What is the class of the output if you join the tracts to the ACS data (i.e reverse the order of the inputs)?
acs_and_tracts_ac = base::merge(acs5_df_ac, tracts_sf_ac, by.y = 'GEOID', by.x = "FIPS_11_digit", all.x=TRUE)
class(acs_and_tracts_ac)
## [1] "data.frame"
ORDER MATTERS with
base::merge
If you use base::merge to join a dataframe to an spatial dataframe the output is a spatial df.
If you use base::merge to join a spatial dataframe to a dataframe the output is as dataframe!
Take a look at the output
# take a look at the output
head(tracts_acs_sf_ac)
## Simple feature collection with 6 features and 52 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.2694 ymin: 37.83454 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS: NAD83
## GEOID STATEFP COUNTYFP TRACTCE AFFGEOID NAME.x LSAD ALAND
## 1 06001400100 06 001 400100 1400000US06001400100 4001 CT 6894340
## 2 06001400200 06 001 400200 1400000US06001400200 4002 CT 586561
## 3 06001400300 06 001 400300 1400000US06001400300 4003 CT 1105851
## 4 06001400400 06 001 400400 1400000US06001400400 4004 CT 715616
## 5 06001400500 06 001 400500 1400000US06001400500 4005 CT 590307
## 6 06001400600 06 001 400600 1400000US06001400600 4006 CT 297856
## AWATER NAME.y c_race c_white c_black
## 1 0 Census Tract 4001, Alameda County, California 3115 2055 128
## 2 0 Census Tract 4002, Alameda County, California 2025 1436 59
## 3 0 Census Tract 4003, Alameda County, California 5000 3458 380
## 4 0 Census Tract 4004, Alameda County, California 3843 2551 229
## 5 0 Census Tract 4005, Alameda County, California 3786 1885 990
## 6 0 Census Tract 4006, Alameda County, California 1638 817 343
## c_asian c_latinx state_fips county_fips tract_fips med_rent med_hhinc
## 1 592 104 6 1 400100 3501 200893
## 2 183 178 6 1 400200 1902 160536
## 3 535 364 6 1 400300 1481 94732
## 4 373 420 6 1 400400 1624 113036
## 5 264 334 6 1 400500 1627 103846
## 6 144 124 6 1 400600 1640 127232
## c_tenants c_owners c_renters c_movers c_stay c_movelocal c_movecounty
## 1 1297 1158 139 3084 2514 202 120
## 2 855 532 323 2012 1827 40 46
## 3 2441 1057 1384 4948 4159 223 289
## 4 1750 653 1097 3754 3440 113 133
## 5 1622 605 1017 3745 3065 309 180
## 6 657 283 374 1587 1221 152 146
## c_movestate c_moveabroad c_commute c_car c_carpool c_transit c_bike c_walk
## 1 219 29 1542 864 165 271 0 10
## 2 39 60 1211 500 40 426 62 57
## 3 156 121 2975 1252 177 835 202 171
## 4 46 22 2293 933 240 588 188 92
## 5 145 46 2325 983 156 619 177 94
## 6 68 0 1022 370 120 268 93 3
## year p_white p_black p_asian p_latinx p_owners p_renters p_stay
## 1 2018 0.6597111 0.04109149 0.19004815 0.03338684 0.8928296 0.1071704 0.8151751
## 2 2018 0.7091358 0.02913580 0.09037037 0.08790123 0.6222222 0.3777778 0.9080517
## 3 2018 0.6916000 0.07600000 0.10700000 0.07280000 0.4330193 0.5669807 0.8405416
## 4 2018 0.6638043 0.05958886 0.09705959 0.10928962 0.3731429 0.6268571 0.9163559
## 5 2018 0.4978870 0.26148970 0.06973059 0.08821976 0.3729963 0.6270037 0.8184246
## 6 2018 0.4987790 0.20940171 0.08791209 0.07570208 0.4307458 0.5692542 0.7693762
## p_movelocal p_movecounty p_movestate p_moveabroad p_car p_carpool
## 1 0.06549935 0.03891051 0.07101167 0.009403372 0.5603113 0.10700389
## 2 0.01988072 0.02286282 0.01938370 0.029821074 0.4128819 0.03303055
## 3 0.04506871 0.05840744 0.03152789 0.024454325 0.4208403 0.05949580
## 4 0.03010123 0.03542888 0.01225360 0.005860416 0.4068905 0.10466638
## 5 0.08251001 0.04806409 0.03871829 0.012283044 0.4227957 0.06709677
## 6 0.09577820 0.09199748 0.04284814 0.000000000 0.3620352 0.11741683
## p_transit p_bike p_walk geometry
## 1 0.1757458 0.00000000 0.006485084 MULTIPOLYGON (((-122.2469 3...
## 2 0.3517754 0.05119736 0.047068538 MULTIPOLYGON (((-122.2574 3...
## 3 0.2806723 0.06789916 0.057478992 MULTIPOLYGON (((-122.2642 3...
## 4 0.2564326 0.08198866 0.040122111 MULTIPOLYGON (((-122.2618 3...
## 5 0.2662366 0.07612903 0.040430108 MULTIPOLYGON (((-122.2694 3...
## 6 0.2622309 0.09099804 0.002935421 MULTIPOLYGON (((-122.2681 3...
Let’s check that we have all the variables we have in our dataset now.
colnames(tracts_acs_sf_ac)
## [1] "GEOID" "STATEFP" "COUNTYFP" "TRACTCE" "AFFGEOID"
## [6] "NAME.x" "LSAD" "ALAND" "AWATER" "NAME.y"
## [11] "c_race" "c_white" "c_black" "c_asian" "c_latinx"
## [16] "state_fips" "county_fips" "tract_fips" "med_rent" "med_hhinc"
## [21] "c_tenants" "c_owners" "c_renters" "c_movers" "c_stay"
## [26] "c_movelocal" "c_movecounty" "c_movestate" "c_moveabroad" "c_commute"
## [31] "c_car" "c_carpool" "c_transit" "c_bike" "c_walk"
## [36] "year" "p_white" "p_black" "p_asian" "p_latinx"
## [41] "p_owners" "p_renters" "p_stay" "p_movelocal" "p_movecounty"
## [46] "p_movestate" "p_moveabroad" "p_car" "p_carpool" "p_transit"
## [51] "p_bike" "p_walk" "geometry"
Question
It’s always important to run sanity checks on our results, at each step of the way!
In this case, how many rows and columns should we have?
print("Rows and columns in the Alameda County Census tract spatial df:")
## [1] "Rows and columns in the Alameda County Census tract spatial df:"
print(dim(tracts_sf_ac))
## [1] 360 10
print("Row and columns in the ACS5 2018 data for Alameda County:")
## [1] "Row and columns in the ACS5 2018 data for Alameda County:"
print(dim(acs5_df_ac))
## [1] 361 44
print("Rows and columns in the Alameda County Census tract spatial df joined to the ACS data:")
## [1] "Rows and columns in the Alameda County Census tract spatial df joined to the ACS data:"
print(dim(tracts_acs_sf_ac))
## [1] 360 53
Let’s save out our merged data so we can use it in the final notebook.
st_write(tracts_acs_sf_ac, './outdata/tracts_acs_sdf_ac.json', driver='GeoJSON', delete_dsn=T)
## Deleting source `./outdata/tracts_acs_sdf_ac.json' using driver `GeoJSON'
## Writing layer `tracts_acs_sdf_ac' to data source `./outdata/tracts_acs_sdf_ac.json' using driver `GeoJSON'
## Writing 360 features with 52 fields and geometry type Multi Polygon.
We can now make choropleth maps using our attribute joined geodataframe. Go ahead and pick one variable to color the map, then map it using tmap (since it’s too easy using the plot method). You can go back to lesson 5 if you need a refresher on how to make this!
To see the solution, look at the hidden text below.
head(tracts_acs_sf_ac)
## Simple feature collection with 6 features and 52 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.2694 ymin: 37.83454 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS: NAD83
## GEOID STATEFP COUNTYFP TRACTCE AFFGEOID NAME.x LSAD ALAND
## 1 06001400100 06 001 400100 1400000US06001400100 4001 CT 6894340
## 2 06001400200 06 001 400200 1400000US06001400200 4002 CT 586561
## 3 06001400300 06 001 400300 1400000US06001400300 4003 CT 1105851
## 4 06001400400 06 001 400400 1400000US06001400400 4004 CT 715616
## 5 06001400500 06 001 400500 1400000US06001400500 4005 CT 590307
## 6 06001400600 06 001 400600 1400000US06001400600 4006 CT 297856
## AWATER NAME.y c_race c_white c_black
## 1 0 Census Tract 4001, Alameda County, California 3115 2055 128
## 2 0 Census Tract 4002, Alameda County, California 2025 1436 59
## 3 0 Census Tract 4003, Alameda County, California 5000 3458 380
## 4 0 Census Tract 4004, Alameda County, California 3843 2551 229
## 5 0 Census Tract 4005, Alameda County, California 3786 1885 990
## 6 0 Census Tract 4006, Alameda County, California 1638 817 343
## c_asian c_latinx state_fips county_fips tract_fips med_rent med_hhinc
## 1 592 104 6 1 400100 3501 200893
## 2 183 178 6 1 400200 1902 160536
## 3 535 364 6 1 400300 1481 94732
## 4 373 420 6 1 400400 1624 113036
## 5 264 334 6 1 400500 1627 103846
## 6 144 124 6 1 400600 1640 127232
## c_tenants c_owners c_renters c_movers c_stay c_movelocal c_movecounty
## 1 1297 1158 139 3084 2514 202 120
## 2 855 532 323 2012 1827 40 46
## 3 2441 1057 1384 4948 4159 223 289
## 4 1750 653 1097 3754 3440 113 133
## 5 1622 605 1017 3745 3065 309 180
## 6 657 283 374 1587 1221 152 146
## c_movestate c_moveabroad c_commute c_car c_carpool c_transit c_bike c_walk
## 1 219 29 1542 864 165 271 0 10
## 2 39 60 1211 500 40 426 62 57
## 3 156 121 2975 1252 177 835 202 171
## 4 46 22 2293 933 240 588 188 92
## 5 145 46 2325 983 156 619 177 94
## 6 68 0 1022 370 120 268 93 3
## year p_white p_black p_asian p_latinx p_owners p_renters p_stay
## 1 2018 0.6597111 0.04109149 0.19004815 0.03338684 0.8928296 0.1071704 0.8151751
## 2 2018 0.7091358 0.02913580 0.09037037 0.08790123 0.6222222 0.3777778 0.9080517
## 3 2018 0.6916000 0.07600000 0.10700000 0.07280000 0.4330193 0.5669807 0.8405416
## 4 2018 0.6638043 0.05958886 0.09705959 0.10928962 0.3731429 0.6268571 0.9163559
## 5 2018 0.4978870 0.26148970 0.06973059 0.08821976 0.3729963 0.6270037 0.8184246
## 6 2018 0.4987790 0.20940171 0.08791209 0.07570208 0.4307458 0.5692542 0.7693762
## p_movelocal p_movecounty p_movestate p_moveabroad p_car p_carpool
## 1 0.06549935 0.03891051 0.07101167 0.009403372 0.5603113 0.10700389
## 2 0.01988072 0.02286282 0.01938370 0.029821074 0.4128819 0.03303055
## 3 0.04506871 0.05840744 0.03152789 0.024454325 0.4208403 0.05949580
## 4 0.03010123 0.03542888 0.01225360 0.005860416 0.4068905 0.10466638
## 5 0.08251001 0.04806409 0.03871829 0.012283044 0.4227957 0.06709677
## 6 0.09577820 0.09199748 0.04284814 0.000000000 0.3620352 0.11741683
## p_transit p_bike p_walk geometry
## 1 0.1757458 0.00000000 0.006485084 MULTIPOLYGON (((-122.2469 3...
## 2 0.3517754 0.05119736 0.047068538 MULTIPOLYGON (((-122.2574 3...
## 3 0.2806723 0.06789916 0.057478992 MULTIPOLYGON (((-122.2642 3...
## 4 0.2564326 0.08198866 0.040122111 MULTIPOLYGON (((-122.2618 3...
## 5 0.2662366 0.07612903 0.040430108 MULTIPOLYGON (((-122.2694 3...
## 6 0.2622309 0.09099804 0.002935421 MULTIPOLYGON (((-122.2681 3...
# YOUR CODE HERE
Great! We’ve wrapped our heads around the concept of an attribute join.
Now let’s extend that concept to its spatially explicit equivalent: the spatial join!
To start, we’ll read in some other data: The Alameda County schools data.
Then we’ll work with that data and our tracts_acs_sf_ac data together.
# read in school data from a csv file
schools_df = read.csv('notebook_data/alco_schools.csv')
# promote to a spatial df
schools_sf = st_as_sf(schools_df, coords = c('X', 'Y'), crs=4326)
Let’s check if we have to transform the schools to match thetracts_acs_sf_ac’s CRS.
print('schools_sf CRS:')
## [1] "schools_sf CRS:"
print(st_crs(schools_sf))
## Coordinate Reference System:
## User input: EPSG:4326
## wkt:
## GEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## USAGE[
## SCOPE["unknown"],
## AREA["World"],
## BBOX[-90,-180,90,180]],
## ID["EPSG",4326]]
print('tracts_acs_sf_ac CRS:')
## [1] "tracts_acs_sf_ac CRS:"
print(st_crs(tracts_acs_sf_ac))
## Coordinate Reference System:
## User input: NAD83
## wkt:
## GEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]]
Yes we do! Let’s do that.
NOTE: Explicit syntax aiming at that dataset’s CRS leaves less room for human error!
schools_sf = st_transform(schools_sf, st_crs(tracts_acs_sf_ac))
print('schools_sf CRS:')
## [1] "schools_sf CRS:"
print(st_crs(schools_sf))
## Coordinate Reference System:
## User input: NAD83
## wkt:
## GEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]]
print('tracts_acs_sf_ac CRS:')
## [1] "tracts_acs_sf_ac CRS:"
print(st_crs(tracts_acs_sf_ac))
## Coordinate Reference System:
## User input: NAD83
## wkt:
## GEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]]
Now we’re ready to combine the datasets in an analysis.
In this case, we want to get data from the census tract within which each school is located.
But how can we do that? The two datasets don’t share a common column to use for a join.
colnames(tracts_acs_sf_ac)
## [1] "GEOID" "STATEFP" "COUNTYFP" "TRACTCE" "AFFGEOID"
## [6] "NAME.x" "LSAD" "ALAND" "AWATER" "NAME.y"
## [11] "c_race" "c_white" "c_black" "c_asian" "c_latinx"
## [16] "state_fips" "county_fips" "tract_fips" "med_rent" "med_hhinc"
## [21] "c_tenants" "c_owners" "c_renters" "c_movers" "c_stay"
## [26] "c_movelocal" "c_movecounty" "c_movestate" "c_moveabroad" "c_commute"
## [31] "c_car" "c_carpool" "c_transit" "c_bike" "c_walk"
## [36] "year" "p_white" "p_black" "p_asian" "p_latinx"
## [41] "p_owners" "p_renters" "p_stay" "p_movelocal" "p_movecounty"
## [46] "p_movestate" "p_moveabroad" "p_car" "p_carpool" "p_transit"
## [51] "p_bike" "p_walk" "geometry"
colnames(schools_sf)
## [1] "Site" "Address" "City" "State" "Type" "API" "Org"
## [8] "geometry"
However, they do have a shared relationship by way of space!
So, we’ll use a spatial relationship query to figure out the census tract that each school is in, then associate the tract’s data with that school (as additional data in the school’s row). This is a spatial join!
In this case, let’s say we’re interested in the relationship between the median household income in a census tract (tracts_acs_sf_ac$med_hhinc) and a school’s Academic Performance Index (schools_gdf$API).
To start, let’s take a look at the distributions of our two variables of interest.
head(tracts_acs_sf_ac)
## Simple feature collection with 6 features and 52 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.2694 ymin: 37.83454 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS: NAD83
## GEOID STATEFP COUNTYFP TRACTCE AFFGEOID NAME.x LSAD ALAND
## 1 06001400100 06 001 400100 1400000US06001400100 4001 CT 6894340
## 2 06001400200 06 001 400200 1400000US06001400200 4002 CT 586561
## 3 06001400300 06 001 400300 1400000US06001400300 4003 CT 1105851
## 4 06001400400 06 001 400400 1400000US06001400400 4004 CT 715616
## 5 06001400500 06 001 400500 1400000US06001400500 4005 CT 590307
## 6 06001400600 06 001 400600 1400000US06001400600 4006 CT 297856
## AWATER NAME.y c_race c_white c_black
## 1 0 Census Tract 4001, Alameda County, California 3115 2055 128
## 2 0 Census Tract 4002, Alameda County, California 2025 1436 59
## 3 0 Census Tract 4003, Alameda County, California 5000 3458 380
## 4 0 Census Tract 4004, Alameda County, California 3843 2551 229
## 5 0 Census Tract 4005, Alameda County, California 3786 1885 990
## 6 0 Census Tract 4006, Alameda County, California 1638 817 343
## c_asian c_latinx state_fips county_fips tract_fips med_rent med_hhinc
## 1 592 104 6 1 400100 3501 200893
## 2 183 178 6 1 400200 1902 160536
## 3 535 364 6 1 400300 1481 94732
## 4 373 420 6 1 400400 1624 113036
## 5 264 334 6 1 400500 1627 103846
## 6 144 124 6 1 400600 1640 127232
## c_tenants c_owners c_renters c_movers c_stay c_movelocal c_movecounty
## 1 1297 1158 139 3084 2514 202 120
## 2 855 532 323 2012 1827 40 46
## 3 2441 1057 1384 4948 4159 223 289
## 4 1750 653 1097 3754 3440 113 133
## 5 1622 605 1017 3745 3065 309 180
## 6 657 283 374 1587 1221 152 146
## c_movestate c_moveabroad c_commute c_car c_carpool c_transit c_bike c_walk
## 1 219 29 1542 864 165 271 0 10
## 2 39 60 1211 500 40 426 62 57
## 3 156 121 2975 1252 177 835 202 171
## 4 46 22 2293 933 240 588 188 92
## 5 145 46 2325 983 156 619 177 94
## 6 68 0 1022 370 120 268 93 3
## year p_white p_black p_asian p_latinx p_owners p_renters p_stay
## 1 2018 0.6597111 0.04109149 0.19004815 0.03338684 0.8928296 0.1071704 0.8151751
## 2 2018 0.7091358 0.02913580 0.09037037 0.08790123 0.6222222 0.3777778 0.9080517
## 3 2018 0.6916000 0.07600000 0.10700000 0.07280000 0.4330193 0.5669807 0.8405416
## 4 2018 0.6638043 0.05958886 0.09705959 0.10928962 0.3731429 0.6268571 0.9163559
## 5 2018 0.4978870 0.26148970 0.06973059 0.08821976 0.3729963 0.6270037 0.8184246
## 6 2018 0.4987790 0.20940171 0.08791209 0.07570208 0.4307458 0.5692542 0.7693762
## p_movelocal p_movecounty p_movestate p_moveabroad p_car p_carpool
## 1 0.06549935 0.03891051 0.07101167 0.009403372 0.5603113 0.10700389
## 2 0.01988072 0.02286282 0.01938370 0.029821074 0.4128819 0.03303055
## 3 0.04506871 0.05840744 0.03152789 0.024454325 0.4208403 0.05949580
## 4 0.03010123 0.03542888 0.01225360 0.005860416 0.4068905 0.10466638
## 5 0.08251001 0.04806409 0.03871829 0.012283044 0.4227957 0.06709677
## 6 0.09577820 0.09199748 0.04284814 0.000000000 0.3620352 0.11741683
## p_transit p_bike p_walk geometry
## 1 0.1757458 0.00000000 0.006485084 MULTIPOLYGON (((-122.2469 3...
## 2 0.3517754 0.05119736 0.047068538 MULTIPOLYGON (((-122.2574 3...
## 3 0.2806723 0.06789916 0.057478992 MULTIPOLYGON (((-122.2642 3...
## 4 0.2564326 0.08198866 0.040122111 MULTIPOLYGON (((-122.2618 3...
## 5 0.2662366 0.07612903 0.040430108 MULTIPOLYGON (((-122.2694 3...
## 6 0.2622309 0.09099804 0.002935421 MULTIPOLYGON (((-122.2681 3...
hist(tracts_acs_sf_ac$med_hhinc)
hist(schools_sf$API)
Oh, right! Those pesky schools with no reported APIs (i.e. API == 0)! Let’s drop those.
schools_sf_api = schools_sf[schools_sf$API > 0, ]
hist(schools_sf_api$API)
Much better!
Now, maybe we think there ought to be some correlation between the two variables? As a first pass at this possibility, let’s overlay the two datasets, coloring each one by its variable of interest. This should give us a sense of whether or not similar values co-occur.
tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(tracts_acs_sf_ac) +
tm_polygons(col = 'med_hhinc',
border.col="white",
palette = 'RdYlGn',
style="jenks") +
tm_shape(schools_sf_api) +
tm_dots(col = 'API',
palette = 'RdYlGn',
border.col="black",
style="jenks",
size = 0.15)
Though it’s hard to say for sure, it certainly looks possible. It would be ideal to scatterplot the variables! But in order to do that, we need to know the median household income in each school’s tract, which means we definitely need our spatial join!
We’ll first take a look at the documentation for the spatial join function, st_join.
?st_join
Looks like the key arguments to consider are:
the two sf data.frames to be spatially joined (x and y)
the type of join to execute (left=), which is a left join if TRUE (default), or an inner join if FALSE
the spatial relationship query to use in the join (join=), which by default is st_intersects
NOTES:
By default st_join is a left join
By default st_join maintains the geometries of the first data.frame input to the operation (i.e. the geometries of x).
When spatially joining two
sfdataframes withst_join, the spatial dataframe whose geometry you want to keep should be listed first!
Question
sf data.frame are we joining onto which (i.e. which one is getting the other one’s data added to it)?sf data.frame should be x, which should be y, and should left be TRUE or FALSE?Alright! Let’s run our join!
schools_jointracts = st_join(schools_sf_api, tracts_acs_sf_ac)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
# We don't need to specify default arguments!
#schools_jointracts = st_join(schools_sf_api, tracts_acs_sf_ac, left=T, join=st_within)
Question
As always, we want to sanity-check our intermediate result before we rush ahead.
One way to do that is to introspect the structure of the result object a bit.
print(dim(schools_jointracts)) # the join output
## [1] 325 60
print(dim(schools_sf_api)) # the input schools
## [1] 325 8
print(dim(tracts_acs_sf_ac)) # the input tracts
## [1] 360 53
head(schools_jointracts)
## Simple feature collection with 6 features and 59 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -122.2616 ymin: 37.739 xmax: -122.2348 ymax: 37.76911
## Geodetic CRS: NAD83
## Site Address City State Type API Org
## 1 Amelia Earhart Elementary 400 Packet Landing Rd Alameda CA ES 933 Public
## 2 Bay Farm Elementary 200 Aughinbaugh Way Alameda CA ES 932 Public
## 3 Donald D. Lum Elementary 1801 Sandcreek Way Alameda CA ES 853 Public
## 4 Edison Elementary 2700 Buena Vista Ave Alameda CA ES 927 Public
## 5 Frank Otis Elementary 3010 Fillmore St Alameda CA ES 894 Public
## 6 Franklin Elementary 1433 San Antonio Ave Alameda CA ES 893 Public
## GEOID STATEFP COUNTYFP TRACTCE AFFGEOID NAME.x LSAD
## 1 06001428302 06 001 428302 1400000US06001428302 4283.02 CT
## 2 06001428302 06 001 428302 1400000US06001428302 4283.02 CT
## 3 06001428500 06 001 428500 1400000US06001428500 4285 CT
## 4 06001427100 06 001 427100 1400000US06001427100 4271 CT
## 5 06001428200 06 001 428200 1400000US06001428200 4282 CT
## 6 06001427900 06 001 427900 1400000US06001427900 4279 CT
## ALAND AWATER NAME.y c_race
## 1 2234425 474770 Census Tract 4283.02, Alameda County, California 7501
## 2 2234425 474770 Census Tract 4283.02, Alameda County, California 7501
## 3 624197 383522 Census Tract 4285, Alameda County, California 3152
## 4 1025446 168187 Census Tract 4271, Alameda County, California 3768
## 5 1362067 804736 Census Tract 4282, Alameda County, California 6643
## 6 825214 0 Census Tract 4279, Alameda County, California 5031
## c_white c_black c_asian c_latinx state_fips county_fips tract_fips med_rent
## 1 3114 273 3070 562 6 1 428302 1946
## 2 3114 273 3070 562 6 1 428302 1946
## 3 1293 268 940 378 6 1 428500 1873
## 4 2446 67 574 457 6 1 427100 1767
## 5 3468 387 1589 686 6 1 428200 1855
## 6 2554 228 1407 563 6 1 427900 1712
## med_hhinc c_tenants c_owners c_renters c_movers c_stay c_movelocal
## 1 147667 2540 2015 525 7436 6705 395
## 2 147667 2540 2015 525 7436 6705 395
## 3 88333 1422 485 937 3125 2641 282
## 4 135833 1420 1040 380 3724 3498 121
## 5 103781 2641 1468 1173 6587 6155 205
## 6 102077 2030 734 1296 4971 4470 108
## c_movecounty c_movestate c_moveabroad c_commute c_car c_carpool c_transit
## 1 99 175 62 3812 2595 296 409
## 2 99 175 62 3812 2595 296 409
## 3 102 100 0 1514 910 65 374
## 4 86 0 19 1755 986 136 303
## 5 71 141 15 3391 2189 229 510
## 6 244 130 19 3102 1705 380 624
## c_bike c_walk year p_white p_black p_asian p_latinx p_owners
## 1 18 73 2018 0.4151446 0.03639515 0.4092788 0.07492334 0.7933071
## 2 18 73 2018 0.4151446 0.03639515 0.4092788 0.07492334 0.7933071
## 3 50 18 2018 0.4102157 0.08502538 0.2982234 0.11992386 0.3410689
## 4 33 64 2018 0.6491507 0.01778132 0.1523355 0.12128450 0.7323944
## 5 51 108 2018 0.5220533 0.05825681 0.2391992 0.10326660 0.5558501
## 6 41 52 2018 0.5076526 0.04531902 0.2796661 0.11190618 0.3615764
## p_renters p_stay p_movelocal p_movecounty p_movestate p_moveabroad
## 1 0.2066929 0.9016945 0.05311996 0.01331361 0.02353416 0.008337816
## 2 0.2066929 0.9016945 0.05311996 0.01331361 0.02353416 0.008337816
## 3 0.6589311 0.8451200 0.09024000 0.03264000 0.03200000 0.000000000
## 4 0.2676056 0.9393126 0.03249194 0.02309345 0.00000000 0.005102041
## 5 0.4441499 0.9344163 0.03112191 0.01077881 0.02140580 0.002277213
## 6 0.6384236 0.8992154 0.02172601 0.04908469 0.02615168 0.003822169
## p_car p_carpool p_transit p_bike p_walk
## 1 0.6807450 0.07764953 0.1072928 0.004721931 0.01915005
## 2 0.6807450 0.07764953 0.1072928 0.004721931 0.01915005
## 3 0.6010568 0.04293263 0.2470277 0.033025099 0.01188904
## 4 0.5618234 0.07749288 0.1726496 0.018803419 0.03646724
## 5 0.6455323 0.06753170 0.1503981 0.015039811 0.03184901
## 6 0.5496454 0.12250161 0.2011605 0.013217279 0.01676338
## geometry
## 1 POINT (-122.2388 37.74476)
## 2 POINT (-122.2519 37.739)
## 3 POINT (-122.2589 37.76206)
## 4 POINT (-122.2348 37.76525)
## 5 POINT (-122.2381 37.75396)
## 6 POINT (-122.2616 37.76911)
Confirmed! The output of the our st_join operation is an sf data.frame (schools_jointracts) with:
sf data.framesLet’s also take a look at an overlay map of the schools on the tracts. If we color the schools categorically by their tracts IDs, then we should see that all schools within a given tract polygon are the same color.
tm_shape(tracts_acs_sf_ac) +
tm_polygons(col='white', border.col='black') +
tm_shape(schools_jointracts) +
tm_dots(col='GEOID', size=0.2)
## Warning: Number of levels of the variable "GEOID" is 208, which is
## larger than max.categories (which is 30), so levels are combined. Set
## tmap_options(max.categories = 208) in the layer function to show all levels.
Fantastic! That looks right!
Now we can create that scatterplot we were thinking about!
plot(schools_jointracts$med_hhinc, schools_jointracts$API,
xlab = 'median household income ($)',
ylab = 'API')
Wow! Just as we suspected based on our overlay map, there’s a pretty obvious, strong, and positive correlation between median household income in a school’s tract and the school’s API.
Instead of joining everything in the ACS data to the schools you can just pick one var!
#simple spatial join
schools_jointracts2 = st_join(schools_sf_api, tracts_acs_sf_ac['med_hhinc'])
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
head(schools_jointracts2)
## Simple feature collection with 6 features and 8 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -122.2616 ymin: 37.739 xmax: -122.2348 ymax: 37.76911
## Geodetic CRS: NAD83
## Site Address City State Type API Org
## 1 Amelia Earhart Elementary 400 Packet Landing Rd Alameda CA ES 933 Public
## 2 Bay Farm Elementary 200 Aughinbaugh Way Alameda CA ES 932 Public
## 3 Donald D. Lum Elementary 1801 Sandcreek Way Alameda CA ES 853 Public
## 4 Edison Elementary 2700 Buena Vista Ave Alameda CA ES 927 Public
## 5 Frank Otis Elementary 3010 Fillmore St Alameda CA ES 894 Public
## 6 Franklin Elementary 1433 San Antonio Ave Alameda CA ES 893 Public
## med_hhinc geometry
## 1 147667 POINT (-122.2388 37.74476)
## 2 147667 POINT (-122.2519 37.739)
## 3 88333 POINT (-122.2589 37.76206)
## 4 135833 POINT (-122.2348 37.76525)
## 5 103781 POINT (-122.2381 37.75396)
## 6 102077 POINT (-122.2616 37.76911)
We just saw that a spatial join in one way to leverage the spatial relationship between two datasets in order to create a new, composite dataset.
An aggregation is another way we can generate new data from this relationship. In this case, for each feature in one dataset we find all the features in another dataset that satisfy our chosen spatial relationship query with it (e.g. within, intersects), then aggregate one or more of the joined output variables using some summary function (e.g. count, mean).
In our spatial join exercise the output suggested that API scores are related to median household income. However, many census tracts have more than one school. So, a mean API score per census tract would be a better way to explore this relationship. Let’s use a spatial data aggregation to calculate that value.
tracts_meanAPI = sf:::aggregate.sf(x=schools_sf_api['API'], by=tracts_acs_sf_ac, FUN=mean)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
head(tracts_meanAPI)
## Simple feature collection with 6 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.2694 ymin: 37.83454 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS: NAD83
## API geometry
## 1 864 MULTIPOLYGON (((-122.2469 3...
## 2 703 MULTIPOLYGON (((-122.2574 3...
## 3 NA MULTIPOLYGON (((-122.2642 3...
## 4 892 MULTIPOLYGON (((-122.2618 3...
## 5 NA MULTIPOLYGON (((-122.2694 3...
## 6 NA MULTIPOLYGON (((-122.2681 3...
Let’s plot the results
# plot the tracts, coloring them by *mean* API
tm_shape(tracts_acs_sf_ac) +
tm_polygons(col = 'med_hhinc',
border.col='white',
palette = 'RdYlGn',
style = 'jenks',
title = 'Median HHInc by Tract') +
# Now plot the tracts as points, colored by med HHI
tm_shape(tracts_meanAPI) +
tm_dots(col='API',
border.col='black',
palette='RdYlGn',
style='jenks',
size=0.1,
title='Mean API by Tract')
tracts_acs_sf_ac so that we can create a scatter plot of the data.#simple spatial join of one var
# first rename the col before the join bc alread have an API col
colnames(tracts_meanAPI) <- c('mean_API','geometry')
tracts_acs_sf_ac2 = st_join(tracts_acs_sf_ac, tracts_meanAPI['mean_API'])
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
plot(tracts_acs_sf_ac2$med_hhinc, tracts_acs_sf_ac2$mean_API,
xlab = 'median household income ($)',
ylab = 'Mean API')
In this exercise, you will practice spatial aggregation and spatial joins.
schools_sf) within each census tract with the sum function.tracts_acs_sf_ac2Note: to make this easier, we will add a dummy variable “count” to the schools data and set it to 1.
schools_sf$school_count = 1 # Add a new variable
head(schools_sf)# Take a look
## Simple feature collection with 6 features and 8 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -122.2616 ymin: 37.739 xmax: -122.2348 ymax: 37.76911
## Geodetic CRS: NAD83
## Site Address City State Type API Org
## 1 Amelia Earhart Elementary 400 Packet Landing Rd Alameda CA ES 933 Public
## 2 Bay Farm Elementary 200 Aughinbaugh Way Alameda CA ES 932 Public
## 3 Donald D. Lum Elementary 1801 Sandcreek Way Alameda CA ES 853 Public
## 4 Edison Elementary 2700 Buena Vista Ave Alameda CA ES 927 Public
## 5 Frank Otis Elementary 3010 Fillmore St Alameda CA ES 894 Public
## 6 Franklin Elementary 1433 San Antonio Ave Alameda CA ES 893 Public
## geometry school_count
## 1 POINT (-122.2388 37.74476) 1
## 2 POINT (-122.2519 37.739) 1
## 3 POINT (-122.2589 37.76206) 1
## 4 POINT (-122.2348 37.76525) 1
## 5 POINT (-122.2381 37.75396) 1
## 6 POINT (-122.2616 37.76911) 1
# YOUR CODE HERE
# Aggregate the number of schools (`schools_sf`) within each census tract with the `sum` function
# And save to a new sf dataframe called `school_counts_by_tract`
school_counts_by_tract = sf:::aggregate.sf(x=schools_sf['school_count'], by=tracts_acs_sf_ac, FUN=sum)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
# Take a look
head(school_counts_by_tract)
## Simple feature collection with 6 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.2694 ymin: 37.83454 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS: NAD83
## school_count geometry
## 1 1 MULTIPOLYGON (((-122.2469 3...
## 2 1 MULTIPOLYGON (((-122.2574 3...
## 3 NA MULTIPOLYGON (((-122.2642 3...
## 4 2 MULTIPOLYGON (((-122.2618 3...
## 5 1 MULTIPOLYGON (((-122.2694 3...
## 6 NA MULTIPOLYGON (((-122.2681 3...
# Join the count back to the Alameda County census tract data frame `tracts_acs_sf_ac2`
tracts_acs_sf_ac2 = st_join(tracts_acs_sf_ac2, school_counts_by_tract['school_count'])
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
# Take a look at the output
head(tracts_acs_sf_ac2)
## Simple feature collection with 6 features and 54 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.2469 ymin: 37.84996 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS: NAD83
## GEOID STATEFP COUNTYFP TRACTCE AFFGEOID NAME.x LSAD
## 1 06001400100 06 001 400100 1400000US06001400100 4001 CT
## 1.8 06001400100 06 001 400100 1400000US06001400100 4001 CT
## 1.9 06001400100 06 001 400100 1400000US06001400100 4001 CT
## 1.10 06001400100 06 001 400100 1400000US06001400100 4001 CT
## 1.11 06001400100 06 001 400100 1400000US06001400100 4001 CT
## 1.12 06001400100 06 001 400100 1400000US06001400100 4001 CT
## ALAND AWATER NAME.y c_race
## 1 6894340 0 Census Tract 4001, Alameda County, California 3115
## 1.8 6894340 0 Census Tract 4001, Alameda County, California 3115
## 1.9 6894340 0 Census Tract 4001, Alameda County, California 3115
## 1.10 6894340 0 Census Tract 4001, Alameda County, California 3115
## 1.11 6894340 0 Census Tract 4001, Alameda County, California 3115
## 1.12 6894340 0 Census Tract 4001, Alameda County, California 3115
## c_white c_black c_asian c_latinx state_fips county_fips tract_fips
## 1 2055 128 592 104 6 1 400100
## 1.8 2055 128 592 104 6 1 400100
## 1.9 2055 128 592 104 6 1 400100
## 1.10 2055 128 592 104 6 1 400100
## 1.11 2055 128 592 104 6 1 400100
## 1.12 2055 128 592 104 6 1 400100
## med_rent med_hhinc c_tenants c_owners c_renters c_movers c_stay
## 1 3501 200893 1297 1158 139 3084 2514
## 1.8 3501 200893 1297 1158 139 3084 2514
## 1.9 3501 200893 1297 1158 139 3084 2514
## 1.10 3501 200893 1297 1158 139 3084 2514
## 1.11 3501 200893 1297 1158 139 3084 2514
## 1.12 3501 200893 1297 1158 139 3084 2514
## c_movelocal c_movecounty c_movestate c_moveabroad c_commute c_car
## 1 202 120 219 29 1542 864
## 1.8 202 120 219 29 1542 864
## 1.9 202 120 219 29 1542 864
## 1.10 202 120 219 29 1542 864
## 1.11 202 120 219 29 1542 864
## 1.12 202 120 219 29 1542 864
## c_carpool c_transit c_bike c_walk year p_white p_black p_asian
## 1 165 271 0 10 2018 0.6597111 0.04109149 0.1900482
## 1.8 165 271 0 10 2018 0.6597111 0.04109149 0.1900482
## 1.9 165 271 0 10 2018 0.6597111 0.04109149 0.1900482
## 1.10 165 271 0 10 2018 0.6597111 0.04109149 0.1900482
## 1.11 165 271 0 10 2018 0.6597111 0.04109149 0.1900482
## 1.12 165 271 0 10 2018 0.6597111 0.04109149 0.1900482
## p_latinx p_owners p_renters p_stay p_movelocal p_movecounty
## 1 0.03338684 0.8928296 0.1071704 0.8151751 0.06549935 0.03891051
## 1.8 0.03338684 0.8928296 0.1071704 0.8151751 0.06549935 0.03891051
## 1.9 0.03338684 0.8928296 0.1071704 0.8151751 0.06549935 0.03891051
## 1.10 0.03338684 0.8928296 0.1071704 0.8151751 0.06549935 0.03891051
## 1.11 0.03338684 0.8928296 0.1071704 0.8151751 0.06549935 0.03891051
## 1.12 0.03338684 0.8928296 0.1071704 0.8151751 0.06549935 0.03891051
## p_movestate p_moveabroad p_car p_carpool p_transit p_bike p_walk
## 1 0.07101167 0.009403372 0.5603113 0.1070039 0.1757458 0 0.006485084
## 1.8 0.07101167 0.009403372 0.5603113 0.1070039 0.1757458 0 0.006485084
## 1.9 0.07101167 0.009403372 0.5603113 0.1070039 0.1757458 0 0.006485084
## 1.10 0.07101167 0.009403372 0.5603113 0.1070039 0.1757458 0 0.006485084
## 1.11 0.07101167 0.009403372 0.5603113 0.1070039 0.1757458 0 0.006485084
## 1.12 0.07101167 0.009403372 0.5603113 0.1070039 0.1757458 0 0.006485084
## mean_API school_count geometry
## 1 864 1 MULTIPOLYGON (((-122.2469 3...
## 1.8 864 1 MULTIPOLYGON (((-122.2469 3...
## 1.9 864 2 MULTIPOLYGON (((-122.2469 3...
## 1.10 864 NA MULTIPOLYGON (((-122.2469 3...
## 1.11 864 NA MULTIPOLYGON (((-122.2469 3...
## 1.12 864 NA MULTIPOLYGON (((-122.2469 3...
# map the output with plot
plot(tracts_acs_sf_ac2['school_count'])